In this project, we aim to implement a proper movie recommendation system for customers using the data frame provided by netflix back in 2005 for its prize competition. Here, we will supplement suitable data pre-processing methods, data explorations, and data visualizations to ultimately aid the readers of this report in understanding the work-flow of our project, which finally leads to the development of our models, complemented with various error analysis, and suggestions for improvements.
Firstly, we import some of the necessary packages for the development of our project and deployment of suitable models.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
In this section we will read the data into our IDE (i.e. in our case, to RStudio).
data <- data %>% rename(movie_id = V1,
customer_id = V2,
rating = V3,
date = V4
) %>%
mutate(date = as.Date(date)) %>%
mutate(rating = as.factor(rating))
data %>% head()## [1] 100480507 4
data_summary <- data %>%
group_by(rating) %>%
count(rating) %>%
mutate(rating = as.factor(rating))
data_summary## [1] "1" "2" "3" "4" "5"
label_percent <- label_dollar(suffix = '%' ,prefix = '')
data_summary <- data_summary %>% mutate(rating = as.factor(rating))
data_summary <- na.omit(data_summary)
data_summary$prob <- data_summary$n/sum(data_summary$n)*100
data_summary <- data_summary %>%
mutate(tooltip = glue("distribution: {label_percent(prob)}"))
data_summarydata_sum_graph <- ggplot(data_summary, aes(x = rating, y = prob, text = tooltip, fill = prob)) +
geom_col(position = "identity") +
labs(title = "Probability Distribution of Movie Ratings",
subtitle = paste("For data set 1 (", unique(data$movie_id), " movies, ", unique(data$customers), " customers, and ", nrow(data$rating), " ratings"),
x = "Movie rating",
y = "Probability") +
scale_fill_gradient(low = "#e4333e", high = "#52171a") +
theme_minimal()
ggplotly(data_sum_graph, tooltip = c("text"))